The Normalized Compression Distance as a Distance Measure in Entity Identification

نویسندگان

  • Sebastian Klenk
  • Dennis Thom
  • Gunther Heidemann
چکیده

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible. We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effect of Image Linearization on Normalized Compression Distance

Normalized Information Distance, based on Kolmogorov complexity, is an emerging metric for image similarity. It is approximated by the Normalized Compression Distance (NCD) which generates the relative distance between two strings by using standard compression algorithms to compare linear strings of information. This relative distance quantifies the degree of similarity between the two objects....

متن کامل

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...

متن کامل

Nonapproximablity of the Normalized Information Distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...

متن کامل

Cover Song Identification Based on Data Compression

We present a system for cover song identification. Our approach combines chord sequence estimation with a similarity metric called normalized compression distance.

متن کامل

Normalized Distance Matrix Method for Construction of Phylogenetic Trees Using New Compressor - Dnabit Compress

We define a compression distance, based on a normal compressor to show it is an admissible distance. The first theme concerns the statistical significance of compressed file sizes. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. In applying the approach, we have used a new DNA sequence compresso...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009